Open the notebook in jupyterlab for optimal experience.
Some charts/diagrams/features are not visible in github.
This impacts all:
This is standard and well-known behaviour.
While some workarounds could be possible, there are no fixes for all of the issues.
Open the notebook in jupyterlab for optimal experience.
display()
%reload_ext autoreload
%autoreload 1
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import plotly.graph_objects as go
import plotly.express as px
import numpy as np
import pandas as pd
import folium
from folium import plugins
import seaborn as sns
from functools import lru_cache
from itertools import product
from random import random, seed
from utils import edu_utils as u
from utils import auto_kaggle
from utils import chart_utils as charts
from utils import folium_utils
from utils import pandas_auto_types as pat
%aimport utils.edu_utils
%aimport utils.auto_kaggle
%aimport utils.chart_utils
%aimport utils.folium_utils
seed(100)
pd.options.display.max_rows = 100
u.check("done")
yes! ✅
# covid
dataset = "kimjihoo/coronavirusdataset"
csv = "SeoulFloating.csv"
auto_kaggle.download_dataset(dataset, csv)
Kaggle API 1.5.12 - login as 'edualmas' Dataset kimjihoo/coronavirusdataset Skipped
I created a small utility library (pandas_auto_types - pat) with custom functions to perform common tasks on datasets
read_csv_kws = {"na_values": [" ", "-"]}
# Cases and Patients
raw_case = pat.read_csv(
"dataset/Case.csv", read_csv_kws, category_threshold_percent=0.3
)
raw_patientInfo = pat.read_csv(
"dataset/PatientInfo.csv",
read_csv_kws,
convert_dtypes_kws={"convert_integer": False},
).set_index("patient_id")
# Timelines and trends
raw_weather = pat.read_csv(
"dataset/Weather.csv", read_csv_kws, category_threshold_percent=0.01
)
raw_timeAge = pat.read_csv("dataset/TimeAge.csv", read_csv_kws)
raw_time = pat.read_csv("dataset/Time.csv", read_csv_kws)
raw_timeGender = pat.read_csv("dataset/TimeGender.csv", read_csv_kws)
raw_timeProvince = pat.read_csv("dataset/TimeProvince.csv", read_csv_kws)
raw_seoulFloating = pat.read_csv("dataset/SeoulFloating.csv", read_csv_kws)
# online searching trends
raw_searchTrend = pat.read_csv("dataset/SearchTrend.csv", read_csv_kws)
# Gov. Policies
raw_policy = pat.read_csv("dataset/Policy.csv", read_csv_kws)
# Country statistics
raw_region = pat.read_csv("dataset/Region.csv", read_csv_kws)
### Let's inspect all the datasets we have in memory:
all_raw_dfs = u.all_vars_with_prefix("raw_", locals())
all_raw_dfs.keys()
dict_keys(['raw_case', 'raw_patientInfo', 'raw_weather', 'raw_timeAge', 'raw_time', 'raw_timeGender', 'raw_timeProvince', 'raw_seoulFloating', 'raw_searchTrend', 'raw_policy', 'raw_region'])
Let's check the ndtypes of our dataframes
We're particularly interested in checking the datatypes that our custom functions determined for each of the columns
Let's do a quick inspection to all the raw_* dataframes
for name, df in all_raw_dfs.items():
print("#" * 5, name, "#" * (50 - len(name)))
print(df.dtypes, "\n")
##### raw_case ########################################## case_id object province category city category group boolean infection_case string confirmed Int64 latitude Float64 longitude Float64 dtype: object ##### raw_patientInfo ################################### sex category age category country category province category city category infection_case category infected_by string contact_number Float64 symptom_onset_date datetime64[ns] confirmed_date datetime64[ns] released_date datetime64[ns] deceased_date datetime64[ns] state category dtype: object ##### raw_weather ####################################### code Int64 province category date datetime64[ns] avg_temp Float64 min_temp Float64 max_temp Float64 precipitation Float64 max_wind_speed Float64 most_wind_direction Int64 avg_relative_humidity Float64 dtype: object ##### raw_timeAge ####################################### date datetime64[ns] time Int64 age category confirmed Int64 deceased Int64 dtype: object ##### raw_time ########################################## date datetime64[ns] time Int64 test Int64 negative Int64 confirmed Int64 released Int64 deceased Int64 dtype: object ##### raw_timeGender #################################### date datetime64[ns] time Int64 sex category confirmed Int64 deceased Int64 dtype: object ##### raw_timeProvince ################################## date datetime64[ns] time Int64 province category confirmed Int64 released Int64 deceased Int64 dtype: object ##### raw_seoulFloating ################################# date datetime64[ns] hour Int64 birth_year Int64 sex category province category city category fp_num Int64 dtype: object ##### raw_searchTrend ################################### date datetime64[ns] cold Float64 flu Float64 pneumonia Float64 coronavirus Float64 dtype: object ##### raw_policy ######################################## policy_id object country category type string gov_policy string detail string start_date datetime64[ns] end_date datetime64[ns] dtype: object ##### raw_region ######################################## code Int64 province category city string latitude Float64 longitude Float64 elementary_school_count Int64 kindergarten_count Int64 university_count Int64 academy_ratio Float64 elderly_population_ratio Float64 elderly_alone_ratio Float64 nursing_home_count Int64 dtype: object
So far, we have used custom utility functions that will automatically identify ndtypes.
We still see, however, some minor issues (unexpected data types), which we want to fix manually:
time field is all '0'. can be deleted.time is inconsistent (0s or 16s but only 1 entry per day). can be droppedtime field is all '0'. can be deleted.time field is all '0'. can be deleted.birth_year is int64 but should be category (age bucket)As we get raw_ dfs fixed, we will create a copy that we can use for EDA (with the df_ prefix)
, and we'll drop the raw_ ones to save memory
patientInfo¶infection_case, sex and agecity: 1% of rows)Let's calculate the % of missing data, for each column
pat.calculate_na_percent(raw_patientInfo)
sex 21 age 26 country 0 province 0 city 1 infection_case 17 infected_by 73 contact_number 84 symptom_onset_date 86 confirmed_date 0 released_date 69 deceased_date 98 state 0 dtype: int64
infected_by has 73% of NAs, but that's normal because this value is only relevant for a subset of patients (those with "contact with patient" infection_case). We will keep it as it will allow us to do contact tracing.
raw_patientInfo.drop(["contact_number", "symptom_onset_date"], axis=1, inplace=True)
raw_patientInfo.head()
| sex | age | country | province | city | infection_case | infected_by | confirmed_date | released_date | deceased_date | state | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| patient_id | |||||||||||
| 1000000001 | male | 50s | Korea | Seoul | Gangseo-gu | overseas inflow | <NA> | 2020-01-23 | 2020-02-05 | NaT | released |
| 1000000002 | male | 30s | Korea | Seoul | Jungnang-gu | overseas inflow | <NA> | 2020-01-30 | 2020-03-02 | NaT | released |
| 1000000003 | male | 50s | Korea | Seoul | Jongno-gu | contact with patient | 2002000001 | 2020-01-30 | 2020-02-19 | NaT | released |
| 1000000004 | male | 20s | Korea | Seoul | Mapo-gu | overseas inflow | <NA> | 2020-01-30 | 2020-02-15 | NaT | released |
| 1000000005 | female | 20s | Korea | Seoul | Seongbuk-gu | contact with patient | 1000000002 | 2020-01-31 | 2020-02-24 | NaT | released |
unknown category to infection_case, sex and age¶raw_patientInfo["sex"] = pat.coalesce_categorical(raw_patientInfo["sex"])
raw_patientInfo["age"] = pat.coalesce_categorical(raw_patientInfo["age"])
raw_patientInfo["infection_case"] = pat.coalesce_categorical(
raw_patientInfo["infection_case"]
)
raw_patientInfo["city"] = pat.coalesce_categorical(raw_patientInfo["city"])
let's validate that all rows marked as deceased, have a deceased_date
has_deceased_date = raw_patientInfo["deceased_date"].notna()
is_marked_as_deceased = raw_patientInfo["state"] == "deceased"
deceased_matches = is_marked_as_deceased == has_deceased_date
u.check(len(deceased_matches[deceased_matches == False]) == 0)
no! ❌
It seems there are some values that are marked as deceased but they are missing the deceased_date
raw_patientInfo[deceased_matches == False].loc[
:, ["confirmed_date", "deceased_date", "state"]
]
| confirmed_date | deceased_date | state | |
|---|---|---|---|
| patient_id | |||
| 1000000013 | 2020-02-16 | NaT | deceased |
| 1000000109 | 2020-03-07 | NaT | deceased |
| 1000000285 | 2020-03-19 | NaT | deceased |
| 1000000473 | 2020-03-31 | NaT | deceased |
| 1000000997 | 2020-06-08 | NaT | deceased |
| 1000001062 | 2020-06-11 | NaT | deceased |
| 1000001118 | 2020-06-14 | NaT | deceased |
| 1100000071 | 2020-02-28 | NaT | deceased |
| 1100000095 | 2020-03-13 | NaT | deceased |
| 1100000097 | 2020-03-13 | NaT | deceased |
| 6002000002 | 2020-02-22 | NaT | deceased |
| 6022000049 | 2020-03-15 | NaT | deceased |
We don't have deceased date for everyone. We'll keep a small note and move on.
We can do the same analysis for Released patients, with an added complexity: Patients can have a "release_date" while no longer being in state "released" (they might have transitioned to another status after being released)
This was not possible for previous cases (deceased), which made the previous logic simple.
For the check we want to perform on our released_date, we will require a different boolean operation. Pandas does not have an implementation for the IMP operator (Material Implication), but we have created an implementation in our utilities.
Note that the IMP operator is not commutative, so the order of conditions matters.
This will allow us to understand if there are any cases where patients are marked as released but don't have a release date (which would be a problem). We don't care about other conditions (has a release date, but is no longer marked as "released", etc..)
marked_as_released = raw_patientInfo["state"] == "released"
should_have_release_date = raw_patientInfo["released_date"].notna()
invalid_rows = u.assert_imp(
raw_patientInfo, marked_as_released, should_have_release_date
)
u.check(0 == len(invalid_rows))
print(f"{len(invalid_rows)} released patients without a release date")
invalid_rows.head()
no! ❌ 1350 released patients without a release date
| sex | age | country | province | city | infection_case | infected_by | confirmed_date | released_date | deceased_date | state | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| patient_id | |||||||||||
| 1000000015 | male | 70s | Korea | Seoul | Seongdong-gu | Seongdong-gu APT | <NA> | 2020-02-19 | NaT | NaT | released |
| 1000000018 | male | 20s | Korea | Seoul | etc | etc | <NA> | 2020-02-20 | NaT | NaT | released |
| 1000000020 | female | 70s | Korea | Seoul | Seongdong-gu | Seongdong-gu APT | 1000000015 | 2020-02-20 | NaT | NaT | released |
| 1000000022 | male | 30s | Korea | Seoul | Seodaemun-gu | Eunpyeong St. Mary's Hospital | <NA> | 2020-02-21 | NaT | NaT | released |
| 1000000023 | male | 50s | Korea | Seoul | Seocho-gu | Shincheonji Church | <NA> | 2020-02-21 | NaT | NaT | released |
It seems that the dataset is missing some data:
This might extend to other gaps in the data. We will not make any further checks, but it does give us some understanding around how complete/cohesive the data is.
df_patientInfo = raw_patientInfo.copy()
del raw_patientInfo
Time columns for daily aggregates¶We have previously identified 4 dataframes that have daily aggregates, with additional slicings (by region, gender, age group), where the time column is not meaningful (since there is only 1 row per day-and-slicing)
A quick check finds that all rows have the same time value ("0")
print(pd.unique(raw_timeAge["time"]))
print(pd.unique(raw_timeGender["time"]))
print(pd.unique(raw_time["time"]))
print(pd.unique(raw_timeProvince["time"]))
<IntegerArray> [0] Length: 1, dtype: Int64 <IntegerArray> [0] Length: 1, dtype: Int64 <IntegerArray> [16, 0] Length: 2, dtype: Int64 <IntegerArray> [16, 0] Length: 2, dtype: Int64
The first two, we can drop without looking back. They only contain 0.
df_timeAge = raw_timeAge.drop("time", axis=1)
del raw_timeAge
df_timeAge.columns
Index(['date', 'age', 'confirmed', 'deceased'], dtype='object')
df_timeGender = raw_timeGender.drop("time", axis=1)
del raw_timeGender
df_timeGender.columns
Index(['date', 'sex', 'confirmed', 'deceased'], dtype='object')
The other two will require one extra check to make sure that we're not accidentally deleting valuable data
raw_time, has 1 row per dayraw_timeProvince, has 17 rows per day/province (1 per day x 17 provinces)rows_per_day = raw_time["date"].value_counts()
u.check(0 == len(rows_per_day[rows_per_day != 1]))
yes! ✅
rows_per_day = raw_timeProvince["date"].value_counts()
u.check(0 == len(rows_per_day[rows_per_day != 17]))
yes! ✅
With this we confirm that no single day has duplicated entries and that we can easily drop those time columns as well
df_time = raw_time.drop("time", axis=1)
del raw_time
df_time.columns
Index(['date', 'test', 'negative', 'confirmed', 'released', 'deceased'], dtype='object')
df_timeProvince = raw_timeProvince.drop("time", axis=1)
del raw_timeProvince
df_timeProvince.columns
Index(['date', 'province', 'confirmed', 'released', 'deceased'], dtype='object')
seoulFloating¶We want to convert birth_year from numerical series to a categorical one, with a concrete natural ordering
df_seoulFloating = raw_seoulFloating.copy()
del raw_seoulFloating
df_seoulFloating["age"] = df_seoulFloating["birth_year"].astype("str") + "s"
df_seoulFloating = df_seoulFloating[
["date", "hour", "age", "sex", "province", "city", "fp_num"]
]
df_seoulFloating = pat.categorise_column_ordered(
df_seoulFloating, "age", ["20s", "30s", "40s", "50s", "60s", "70s"]
)
df_seoulFloating.head()
| date | hour | age | sex | province | city | fp_num | |
|---|---|---|---|---|---|---|---|
| 0 | 2020-01-01 | 0 | 20s | female | Seoul | Dobong-gu | 19140 |
| 1 | 2020-01-01 | 0 | 20s | male | Seoul | Dobong-gu | 19950 |
| 2 | 2020-01-01 | 0 | 20s | female | Seoul | Dongdaemun-gu | 25450 |
| 3 | 2020-01-01 | 0 | 20s | male | Seoul | Dongdaemun-gu | 27050 |
| 4 | 2020-01-01 | 0 | 20s | female | Seoul | Dongjag-gu | 28880 |
u.check(
0 == df_seoulFloating["age"].isna().sum()
) # our categorical type captured all values
yes! ✅
dataset_creation_date = df_seoulFloating["date"].max().strftime("%Y-%m-%d")
dataset_creation_date
'2020-05-31'
Policy¶raw_policy has empty cells, but they can be ignored. Most of them are on "end_date" which likely means that the policy was still in place when this dataset was created.
the other two are "policy details" that might not be applicable. no need to drop, the data makes sense with all of these NAs
raw_policy.isna().sum()
policy_id 0 country 0 type 0 gov_policy 0 detail 2 start_date 0 end_date 37 dtype: int64
policy_without_enddate = raw_policy.loc[:, raw_policy.columns != "end_date"]
policy_without_enddate[policy_without_enddate.isna().sum(axis=1) > 0]
| policy_id | country | type | gov_policy | detail | start_date | |
|---|---|---|---|---|---|---|
| 50 | 51 | Korea | Technology | Self-Diagnosis App | <NA> | 2020-02-12 |
| 51 | 52 | Korea | Technology | Self-Quarantine Safety Protection App | <NA> | 2020-03-07 |
df_policy = raw_policy
del raw_policy
Weather¶This dataset has a few NA values
We could drop all data previous to 2020, but we're probably better off keeping it all, with gaps, in case we need to find trends across various years. We can ignore/drop data during the analysis phase
raw_weather.isna().sum()
code 0 province 0 date 0 avg_temp 15 min_temp 5 max_temp 3 precipitation 0 max_wind_speed 9 most_wind_direction 29 avg_relative_humidity 20 dtype: int64
raw_weather["date"] = pd.to_datetime(raw_weather["date"], infer_datetime_format=True)
raw_weather[raw_weather["date"] >= "2020-01-01"].isna().sum()
code 0 province 0 date 0 avg_temp 0 min_temp 0 max_temp 0 precipitation 0 max_wind_speed 0 most_wind_direction 1 avg_relative_humidity 0 dtype: int64
df_weather = raw_weather
del raw_weather
As we have been fixing some of the raw_ dataframes, we've been creating eda-ready DFs.
The rest of the raw dataframes were good enough to require no manual tweaking, let's create the EDA-ready ones and drop the remaining raw dfs.
u.all_vars_with_prefix("raw_", locals()).keys()
dict_keys(['raw_case', 'raw_searchTrend', 'raw_region'])
df_case = raw_case.copy()
del raw_case
df_searchTrend = raw_searchTrend.copy()
del raw_searchTrend
df_region = raw_region.copy()
del raw_region
u.check(0 == len(u.all_vars_with_prefix("raw_", locals()).keys()))
u.check(11 == len(u.all_vars_with_prefix("df_", locals()).keys()))
yes! ✅ yes! ✅
In this initial section of actual exploration, we want to get a general sense for the data, find general trends and develop some hypothesis so we can validate them later.
We will start from the most inocuous and simple to the more complex, so we can progessively build up our contextual domain knowledge (around Korea as a country, covid as an epidemic, and their interaction: spread vs weather patterns vs policies, etc.. )
Region Dataset¶The first thing we want to do is get a general understanding of Korea. As an analyst with very little context, we want to get a general feel for:
df_region.head()
| code | province | city | latitude | longitude | elementary_school_count | kindergarten_count | university_count | academy_ratio | elderly_population_ratio | elderly_alone_ratio | nursing_home_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10000 | Seoul | Seoul | 37.566953 | 126.977977 | 607 | 830 | 48 | 1.44 | 15.38 | 5.8 | 22739 |
| 1 | 10010 | Seoul | Gangnam-gu | 37.518421 | 127.047222 | 33 | 38 | 0 | 4.18 | 13.17 | 4.3 | 3088 |
| 2 | 10020 | Seoul | Gangdong-gu | 37.530492 | 127.123837 | 27 | 32 | 0 | 1.54 | 14.55 | 5.4 | 1023 |
| 3 | 10030 | Seoul | Gangbuk-gu | 37.639938 | 127.025508 | 14 | 21 | 0 | 0.67 | 19.49 | 8.5 | 628 |
| 4 | 10040 | Seoul | Gangseo-gu | 37.551166 | 126.849506 | 36 | 56 | 1 | 1.17 | 14.39 | 5.7 | 1080 |
province_analysis = df_region.copy()
region_elderly_order = (
province_analysis[["province", "elderly_population_ratio"]]
.groupby("province")
.mean()
.sort_values("elderly_population_ratio", ascending=False)
)
region_elderly = province_analysis[["province", "city", "elderly_population_ratio"]]
nursing_home_by_province = (
province_analysis[["province", "nursing_home_count"]]
.groupby("province")
.sum()
.sort_values("nursing_home_count", ascending=False)
)
nursing_home_order = nursing_home_by_province.index.drop("Korea")
f, (ax_elderly, ax_nursing) = plt.subplots(1, 2, figsize=(12, 7))
plt.gcf().tight_layout(pad=10.0)
sns.barplot(
data=region_elderly,
x="elderly_population_ratio",
y="province",
order=region_elderly_order.index,
ax=ax_elderly,
)
sns.barplot(
data=nursing_home_by_province,
x="nursing_home_count",
y=nursing_home_by_province.index,
order=region_elderly_order.index,
ax=ax_nursing,
)
plt.show()
Based on this overview, we want to keep an eye on the following regions, due to their at-risk population:
High population in nursing homes (absolute numbers):
High % of elderly population
By crossing information from both graphs, we can also hypothesize that Seoul and Gyeonggi-do also have the highest population density in the country, because:
Let's verify this by crossing a couple of additional data points to get a better picture:
province_analysis["infrastructure_buildings"] = (
province_analysis["elementary_school_count"]
+ province_analysis["kindergarten_count"]
+ province_analysis["university_count"]
+ province_analysis["nursing_home_count"]
)
population_estimate = (
province_analysis[["province", "infrastructure_buildings"]]
.groupby("province")
.sum()
)
sns.barplot(
data=population_estimate,
y=population_estimate.index,
x=population_estimate["infrastructure_buildings"],
)
plt.title("Infrastructure Buildings by province")
Text(0.5, 1.0, 'Infrastructure Buildings by province')
Even though, we cannot calculate the population density for each province, we can get an approximate idea of the areas with more public buildings, which can be an indicator of overal population distribution
We can tentatively confirm our previous hypothesis for Seoul and Gyeonggi-do, since the trend also extends to other buildings, not just nursing homes.
TimeProvince Dataset¶Continuing from our previous visualization, let's try to get an understanding of how each province has evolved during the first months of this pandemic
cases_by_province = df_timeProvince.groupby(["province", "date"]).sum(numeric_only=True)
cases_by_province = cases_by_province.reset_index().melt(
id_vars=["province", "date"],
value_vars=["confirmed", "released", "deceased"],
var_name="metric",
value_name="cases",
)
cases_by_province.head()
| province | date | metric | cases | |
|---|---|---|---|---|
| 0 | Busan | 2020-01-20 | confirmed | 0 |
| 1 | Busan | 2020-01-21 | confirmed | 0 |
| 2 | Busan | 2020-01-22 | confirmed | 0 |
| 3 | Busan | 2020-01-23 | confirmed | 0 |
| 4 | Busan | 2020-01-24 | confirmed | 0 |
# We can use a variation of a standard RAG color coding:
Amber = "#ffa000"
covid_palette = {
"deceased": "Red",
"confirmed": Amber,
"released": "Green",
"negative": "Black",
"test": "Gray",
}
g = sns.FacetGrid(cases_by_province, col="province", col_wrap=5)
g.map(sns.lineplot, "date", "cases", "metric", palette=covid_palette)
g.set_xticklabels("")
g.add_legend()
<seaborn.axisgrid.FacetGrid at 0x7f0260458400>
Even with such little data, we can already confirm a few things:
Let's drill down and compare numbers of confirmed cases across the most impacted regions
incrementalCasesPerDay = cases_by_province[
["date", "metric", "province", "cases"]
].copy()
incrementalCasesPerDay["province"] = incrementalCasesPerDay["province"].astype(str)
incrementalCasesPerDay = incrementalCasesPerDay[
incrementalCasesPerDay["province"].isin(
["Daegu", "Gyeonggi-do", "Gyeongsangbuk-do", "Seoul"]
)
]
incrementalCasesPerDay = incrementalCasesPerDay.set_index(
["date", "metric", "province"]
)
incrementalCasesPerDay = (
incrementalCasesPerDay.unstack() # unstack metric to columns
.unstack() # unstack province to columns
.diff() # calculate diff with previous row
.fillna(0) # set to 0 for first row
.stack() # stack province back unto rows
.stack() # stack metric back unto rows
)
incrementalCasesPerDay = incrementalCasesPerDay.rename(columns={"cases": "new_cases"})
incrementalCasesPerDay[incrementalCasesPerDay["new_cases"] < 0]
| new_cases | |||
|---|---|---|---|
| date | metric | province | |
| 2020-03-30 | released | Seoul | -1 |
| 2020-04-17 | released | Gyeongsangbuk-do | -7 |
| 2020-05-13 | released | Daegu | -9 |
| 2020-06-27 | deceased | Gyeonggi-do | -1 |
It seems that there are some minor cases where a day's values were adjusted. nothing major, we won't be changing those or resetting to 0, and we'll take at face value.
These things can happen in a global pandemic. Everyone knows that.
Let's plot these 4 provinces so we can zoom and get a better understanding
g = sns.FacetGrid(incrementalCasesPerDay.reset_index(), col="province")
g.map(sns.lineplot, "date", "new_cases", "metric", palette=covid_palette)
charts.rotate_x_labels(g)
g.add_legend()
<seaborn.axisgrid.FacetGrid at 0x7f02a111d6d0>
daegu_cases = incrementalCasesPerDay.reset_index()
daegu_cases = daegu_cases[daegu_cases["province"] == "Daegu"]
sns.lineplot(daegu_cases, x="date", y="new_cases", hue="metric", palette=covid_palette)
plt.title("Daegu")
Text(0.5, 1.0, 'Daegu')
top4_minus_Daegu = incrementalCasesPerDay.reset_index()
top4_minus_Daegu = top4_minus_Daegu[top4_minus_Daegu["province"] != "Daegu"]
g = sns.FacetGrid(top4_minus_Daegu, col="province", col_wrap=3)
g.map(sns.lineplot, "date", "new_cases", "metric", palette=covid_palette)
charts.rotate_x_labels(g)
TimeGender Dataset¶Let's try to do the same analysis, but by gender instead of province to see if we can spot any major differences.
We're expecting to find no significant differences.
inc_gender = df_timeGender.set_index(["date", "sex"]).unstack()
inc_gender = inc_gender.diff()[
1:
] # drop first row. we cannot compare it with the previous value
inc_gender
| confirmed | deceased | |||
|---|---|---|---|---|
| sex | female | male | female | male |
| date | ||||
| 2020-03-03 | 381 | 219 | 3 | 3 |
| 2020-03-04 | 330 | 186 | 0 | 4 |
| 2020-03-05 | 285 | 153 | 2 | 1 |
| 2020-03-06 | 322 | 196 | 3 | 4 |
| 2020-03-07 | 306 | 177 | 1 | 1 |
| ... | ... | ... | ... | ... |
| 2020-06-26 | 15 | 24 | 0 | 0 |
| 2020-06-27 | 23 | 28 | 0 | 0 |
| 2020-06-28 | 24 | 38 | 0 | 0 |
| 2020-06-29 | 22 | 20 | 0 | 0 |
| 2020-06-30 | 18 | 25 | 0 | 0 |
120 rows × 4 columns
sns.lineplot(inc_gender["confirmed"], sizes=[5] * 120)
charts.rotate_x_labels()
sns.lineplot(inc_gender["deceased"], sizes=[5] * 120)
charts.rotate_x_labels()
From this chart we can extract a few interesting insights:
Even though it's important to remember that each number and peak in this graph represents a tragedy and the loss of a human life, and without wanting to minimize any of it, it's important to keep in mind that the population of South Korea is over 51M people, and that these numbers showcase an outstanding performance in terms of damage control.
The analysis of this dataset, if anything, reinforces that we are analysing a true gem around how to handle a pandemic like the COVID one, when it comes to saving lives.
TimeAge Dataset¶We can also take a quick glance at the impact distribution across age groups
df_timeAge
| date | age | confirmed | deceased | |
|---|---|---|---|---|
| 0 | 2020-03-02 | 0s | 32 | 0 |
| 1 | 2020-03-02 | 10s | 169 | 0 |
| 2 | 2020-03-02 | 20s | 1235 | 0 |
| 3 | 2020-03-02 | 30s | 506 | 1 |
| 4 | 2020-03-02 | 40s | 633 | 1 |
| ... | ... | ... | ... | ... |
| 1084 | 2020-06-30 | 40s | 1681 | 3 |
| 1085 | 2020-06-30 | 50s | 2286 | 15 |
| 1086 | 2020-06-30 | 60s | 1668 | 41 |
| 1087 | 2020-06-30 | 70s | 850 | 82 |
| 1088 | 2020-06-30 | 80s | 556 | 139 |
1089 rows × 4 columns
daily_timeAge = (
df_timeAge[["date", "age", "confirmed", "deceased"]]
.set_index(["date", "age"])
.unstack()
.copy()
)
inc_daily_timeAge = daily_timeAge.diff()
inc_daily_timeAge = inc_daily_timeAge.stack().swaplevel().sort_index()
inc_daily_timeAge
| confirmed | deceased | ||
|---|---|---|---|
| age | date | ||
| 0s | 2020-03-03 | 2 | 0 |
| 2020-03-04 | 0 | 0 | |
| 2020-03-05 | 4 | 0 | |
| 2020-03-06 | 7 | 0 | |
| 2020-03-07 | 7 | 0 | |
| ... | ... | ... | ... |
| 80s | 2020-06-26 | 2 | 0 |
| 2020-06-27 | 2 | 0 | |
| 2020-06-28 | 1 | 0 | |
| 2020-06-29 | 0 | 0 | |
| 2020-06-30 | 0 | 0 |
1080 rows × 2 columns
def draw_public_holidays(y, **kw):
# plt.axhline(y=y.max(), color="r", dashes=(2, 1), linewidth=0.4)
draw_holiday(2020, 3, 1, "Samiljeol", 1)
draw_holiday(2020, 5, 5, "Eorininal", 2, "right")
draw_holiday(2020, 5, 7, "Bucheonnim Osinnal", 3)
draw_holiday(2020, 6, 6, "Hyeonchung-il", 4)
def draw_holiday(
y: int, m: int, d: int, name: str, i: int, text_align: str = "left"
) -> None:
plt.axvline(x=u.epoch_for(y, m, d), color="gray", dashes=(1, 10), linewidth=1)
plt.gca().annotate(
name,
(u.epoch_for(y, m, d), 170),
color="gray",
weight="ultralight",
fontsize=9,
ha=text_align,
va="top",
rotation=90,
)
# sources:
# https://en.wikipedia.org/wiki/Public_holidays_in_South_Korea
# https://en.wikipedia.org/wiki/Buddha%27s_Birthday
g = sns.relplot(
data=inc_daily_timeAge.reset_index(),
col="age",
kind="line",
x="date",
y="confirmed",
col_wrap=3,
)
g = g.map(draw_public_holidays, "confirmed")
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle("Confirmed cases per age group")
charts.rotate_x_labels(g)
g = sns.relplot(
data=inc_daily_timeAge.reset_index(),
col="age",
kind="line",
x="date",
y="deceased",
col_wrap=5,
)
g = g.map(
lambda y, **kw: plt.axhline(y=y.median(), color="w", dashes=(2, 1), linewidth=0.4),
"deceased",
)
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle("Deceased cases per age group")
charts.rotate_x_labels(g)
We can extract a few conclusions from this initial peek:
Time Dataset¶Let's take a look at the country-wide timeline to see if any obvious facts emerge from it.
time_cases_per_type = df_time.melt(
id_vars=["date"], var_name="type", value_name="cases"
)
time_cases_per_type.head()
| date | type | cases | |
|---|---|---|---|
| 0 | 2020-01-20 | test | 1 |
| 1 | 2020-01-21 | test | 1 |
| 2 | 2020-01-22 | test | 4 |
| 3 | 2020-01-23 | test | 22 |
| 4 | 2020-01-24 | test | 27 |
sns.lineplot(
data=time_cases_per_type, x="date", y="cases", hue="type", palette=covid_palette
)
plt.gca().add_patch(
patches.Rectangle(
(u.epoch_for(2020, 1, 30), -20000),
150,
80000,
edgecolor="darkblue",
facecolor="#f1f3ff",
fill=True,
lw=0.5,
)
)
plt.ylabel("cases (in millions)")
plt.title("South Korea - Covid-19 - Country-wide Timeline")
Text(0.5, 1.0, 'South Korea - Covid-19 - Country-wide Timeline')
South Korea managed to test over a million people during the first few months of the pandemic, and the vast majority of them were confirmed negative cases.
Let's zoom into the bottom of the chart so we can see the confirmed/released/deceased cases a bit better
noneg_notest = time_cases_per_type[
~time_cases_per_type["type"].isin(["test", "negative"])
]
sns.lineplot(data=noneg_notest, x="date", y="cases", hue="type", palette=covid_palette)
plt.ylabel("cases")
plt.title("South Korea - Covid-19 - Cumulative Country-wide Timeline")
Text(0.5, 1.0, 'South Korea - Covid-19 - Cumulative Country-wide Timeline')
South Korea as a role model for this pandemic
There are various elements that can bring us to this conclusion:
In future analysis we will try to find out the patterns that brought them to these much-better-than-average numbers so we can implement a similar policy in our country
Weather dataset¶We suspect that the searchterms dataframe will have a seasonal nature to its patterns, let's visualize Korea's natural temperature cycles to get a better understanding.
Since South Korea has a temperate climate with a wide range of temperatures, we will use a rolling-average to smooth out the temperatures, so we can see the general trend instead of the precise values.
temperatures = df_weather[
["province", "date", "min_temp", "avg_temp", "max_temp"]
].rename(columns={"min_temp": "min", "avg_temp": "avg", "max_temp": "max"})
korea_temperatures = temperatures[["province", "date", "avg"]]
temps_rolling = (
korea_temperatures.set_index(["date", "province"])
.unstack("province")
.rolling(30, center=True)
.mean()
.stack("province")
)
province_palette = {p: "#b7b7b7" for p in korea_temperatures["province"].unique()}
province_width = {p: 0.1 for p in korea_temperatures["province"].unique()}
province_palette["Seoul"] = "#ff7c00"
province_width["Seoul"] = 1.5
plt.figure(figsize=(10, 6))
sns.lineplot(
data=temps_rolling.reset_index(),
x="date",
y="avg",
hue="province",
palette=province_palette,
size="province",
sizes=province_width,
legend=False,
)
plt.title("Avg Temperature (Seoul vs other provinces)")
Text(0.5, 1.0, 'Avg Temperature (Seoul vs other provinces)')
It seems that Seoul has more extreme temperatures than the rest of the country. It seems like a good candidate to find seasonal changes (winter/summer).
Let's drill down one more level and see the temperature range
seoul_temperatures = temperatures[temperatures["province"] == "Seoul"].drop(
columns="province"
)
temps_palette = {
"min": sns.color_palette()[0],
"max": sns.color_palette()[1],
"avg": sns.color_palette()[2],
}
rolling_window_size = 21
seoul_rolling = seoul_temperatures.rolling(
rolling_window_size, on="date", center=True
).mean()
seoul_rolling = seoul_rolling.melt(
id_vars=["date"], var_name="measure", value_name="temperature"
)
plt.figure(figsize=(20, 5))
sns.lineplot(
data=seoul_rolling,
x="date",
hue="measure",
y="temperature",
linewidth=0.3,
palette=temps_palette,
hue_order=["max", "avg", "min"],
)
plt.axhline(y=0, color="0", linewidth=0.4)
plt.title(f"Temperature in Seoul (rolling avg. over {rolling_window_size} days)")
Text(0.5, 1.0, 'Temperature in Seoul (rolling avg. over 21 days)')
After seeing this, we could use the months with the lowest temperatures as "winter period" for other dataset's analysis
Now we're ready to compare these temperature changes with search trends
SearchTrend Dataset¶searchterms_palette = {
"cold": sns.color_palette()[0],
"flu": sns.color_palette()[1],
"pneumonia": sns.color_palette()[2],
"coronavirus": sns.color_palette()[3],
}
dailyhits = df_searchTrend.melt(id_vars=["date"], var_name="term", value_name="hits")
g = sns.relplot(
data=dailyhits.reset_index(),
col="term",
kind="line",
x="date",
y="hits",
col_wrap=4,
hue="term",
palette=searchterms_palette,
)
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle("Trending searches")
charts.rotate_x_labels(g)
Let's zoom into the lower part of the chart so we can see patterns or trends a bit more closely.
We'll also want to paint "winter" periods so we can more easily detect any seasonal behaviours related to cold/flus
upper_limit = 3.5
lower_limit = -0.05
# we want to highlight that covid searches are 0 for
# most the years and then shoot off in 2020
def paint_winters(series, **kw):
paint_winter(u.epoch_for(2016, 1, 1))
paint_winter(u.epoch_for(2017, 1, 1))
paint_winter(u.epoch_for(2018, 1, 1))
paint_winter(u.epoch_for(2019, 1, 1))
paint_winter(u.epoch_for(2020, 1, 1))
def paint_winter(day):
plt.gca().add_patch(
patches.Rectangle(
(day - 30, lower_limit),
90,
upper_limit - lower_limit,
edgecolor="darkblue",
facecolor="#f8f8f8",
fill=True,
lw=0.5,
)
)
plt.gca().annotate(
"winter",
(day + 15, upper_limit * 0.99),
color="black",
weight="ultralight",
fontsize=9,
ha="center",
va="top",
)
g = sns.FacetGrid(
dailyhits.reset_index(),
col="term",
col_wrap=2,
sharey=False,
height=3,
aspect=3,
ylim=(lower_limit, upper_limit),
xlim=(u.epoch_for(2016, 1, 1), u.epoch_for(2020, 7, 30)),
)
g.map(sns.lineplot, "date", "hits", "term", palette=searchterms_palette)
g = g.map(paint_winters, "hits")
charts.rotate_x_labels(g, 90, 2)
In this quick visualization we can spot some interesting patterns:
If we decide to use search terms for future analysis, we will likely need to smartly merge COVID + PNEUMONIA hits (for the year 2020), as that spike in PNEUMONIA search hits is likely a misnomer for CORONAVIRUS, before the term became widely known
Policy dataset¶Something else we might be able to use for our timeline analysis is the Policy Dataset.
This dataset contains korean policies enacted during the initial months of the Covid pandemic.
The hope is that we can find one or more policies that helped curb down some of the confirmed cases shown in the previous trends.
policies = df_policy.drop(columns=["policy_id", "country"])
policies["end_date"] = policies["end_date"].fillna(dataset_creation_date)
policies["policy_detail"] = (
policies["gov_policy"].str.slice(0, 30)
+ " - "
+ policies["detail"].str.slice(0, 30)
)
policies.head()
| type | gov_policy | detail | start_date | end_date | policy_detail | |
|---|---|---|---|---|---|---|
| 0 | Alert | Infectious Disease Alert Level | Level 1 (Blue) | 2020-01-03 | 2020-01-19 | Infectious Disease Alert Level - Level 1 (Blue) |
| 1 | Alert | Infectious Disease Alert Level | Level 2 (Yellow) | 2020-01-20 | 2020-01-27 | Infectious Disease Alert Level - Level 2 (Yellow) |
| 2 | Alert | Infectious Disease Alert Level | Level 3 (Orange) | 2020-01-28 | 2020-02-22 | Infectious Disease Alert Level - Level 3 (Orange) |
| 3 | Alert | Infectious Disease Alert Level | Level 4 (Red) | 2020-02-23 | 2020-05-31 | Infectious Disease Alert Level - Level 4 (Red) |
| 4 | Immigration | Special Immigration Procedure | from China | 2020-02-04 | 2020-05-31 | Special Immigration Procedure - from China |
default_palette = px.colors.qualitative.Plotly
policy_palette = {
"Alert": "red",
"Immigration": Amber,
"Health": default_palette[2],
"Social": default_palette[3],
"Education": default_palette[5],
"Administrative": default_palette[7],
"Technology": "#9a9a9a",
"Transformation": "#54a24b",
}
fig = px.timeline(
policies,
x_start="start_date",
x_end="end_date",
y="policy_detail",
opacity=0.5,
facet_col_spacing=1,
color="type",
height=1500,
text="detail",
color_discrete_map=policy_palette,
)
fig.show()
Now that we have a general idea of the datasets we have, we can start looking into the data to find patterns, and see if our expectations hold true.
Given that we will not be doing statistical experiments, we will not call them "Hypothesis", we will call them "Expectations", and we will try to validate them with the data we have.
Here's a list of general insights we expect to find in the dataset provided.
Expectation:
Strong social distancing policies helped reduce the number of infections
Let's take at look at the timelines between social distancing policies and the number of cases. Let's start by age group, and then by province.
df_social_distancing = policies[policies["gov_policy"] == "Social Distancing Campaign"]
incubation_days_from = 2
incubation_days_until = 14
incubation_period = incubation_days_from
def draw_policies(y, policy_height=170, **kw):
# plt.axhline(y=y.max(), color="r", dashes=(2, 1), linewidth=0.4)
draw_policy(df_social_distancing.iloc[0], policy_height)
draw_policy(df_social_distancing.iloc[1], policy_height)
draw_policy(df_social_distancing.iloc[2], policy_height)
draw_policy(df_social_distancing.iloc[3], policy_height)
def draw_policy(policy, policy_height: int) -> None:
start = policy["start_date"]
start_date = u.epoch_for(start.year, start.month, start.day) + incubation_period
duration = (policy["end_date"] - policy["start_date"]) / np.timedelta64(1, "D")
color = "#f69e87" if policy["detail"] == "Strong" else "#fddb99"
plt.gca().add_patch(
patches.Rectangle(
(start_date, 0),
duration,
policy_height + 5,
facecolor=color,
fill=True,
lw=0.5,
)
)
plt.gca().annotate(
policy["detail"],
(start_date + 2, policy_height),
color="white",
fontsize=10,
va="top",
rotation=90,
)
def draw_policy_custom_height(policy_height: int):
def curried(policy, **kw):
draw_policies(policy, policy_height, **kw)
return curried
g = sns.relplot(
data=inc_daily_timeAge.reset_index(),
col="age",
kind="line",
x="date",
y="confirmed",
col_wrap=3,
)
g = g.map(draw_policies, "confirmed")
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle("Confirmed cases per age group")
charts.rotate_x_labels(g)
There seems to be some indication that strong social distancing policies contributed to a reduction in confirmed cases among all age groups. The decrease in cases is very notable with "Strong" social distancing policies.
Let's do the same analysis by province, just to be extra sure.
We expect the same downward trend in most if not all of them, ideally, with a strong decline in the top 4 provinces we identified earlier.
daily_timeProvince = (
df_timeProvince[["date", "province", "confirmed", "deceased"]]
.set_index(["date", "province"])
.unstack()
.copy()
)
inc_daily_timeProvince = daily_timeProvince.diff()
inc_daily_timeProvince = inc_daily_timeProvince.stack().swaplevel().sort_index()
inc_daily_timeProvince.head(5)
| confirmed | deceased | ||
|---|---|---|---|
| province | date | ||
| Busan | 2020-01-21 | 0 | 0 |
| 2020-01-22 | 0 | 0 | |
| 2020-01-23 | 0 | 0 | |
| 2020-01-24 | 0 | 0 | |
| 2020-01-25 | 0 | 0 |
g = sns.relplot(
data=inc_daily_timeProvince.reset_index(),
col="province",
kind="line",
x="date",
y="confirmed",
col_wrap=4,
)
g = g.map(draw_policy_custom_height(600), "confirmed")
g.fig.subplots_adjust(top=0.95)
g.fig.suptitle("Confirmed cases per province")
charts.rotate_x_labels(g)
Just as we expected, the policies' timelines overlap in time with the strong decline across all the provinces, but more importantly, with the top 4 impacted regions that we identified earlier.
Conclusion: There seems to be a reduction in cases during the periods where Strong social distancing policies were in place, both when looking at provinces and when looking at age ranges.
This is however not enough to categorically make claims, as there could have been other factors (other non-gov policies, workplace/social changes, etc..). We can keep it as a possibility or a contributing factor as the current data does not allow us to discard it.
Expectation:
Elder population (people aged 65 and above) are most at-risk of serious illness or death if they contract COVID-19
We don't have any data regarding "serious illness" so we can only track deceased statistics, but we have reasons to believe that the same patterns will emerge if we had data around "serious illness/health complications".
At the very least, this vector of exploration will give us a "best case scenario", showing us "the least terrible picture possible in an ideal world".
Let's divide the population into clusters, depending on their age:
def age_group_mapper(age):
if age in ["70s", "80s"]:
return age
elif age in ["30s", "40s", "50s", "60s"]:
return "middle-aged people"
elif age in ["0s", "10s", "20s"]:
return "young people"
else:
return "unknown age group"
df_timeAge["age_group"] = df_timeAge["age"].map(age_group_mapper)
df_timeAge.head(10)
| date | age | confirmed | deceased | age_group | |
|---|---|---|---|---|---|
| 0 | 2020-03-02 | 0s | 32 | 0 | young people |
| 1 | 2020-03-02 | 10s | 169 | 0 | young people |
| 2 | 2020-03-02 | 20s | 1235 | 0 | young people |
| 3 | 2020-03-02 | 30s | 506 | 1 | middle-aged people |
| 4 | 2020-03-02 | 40s | 633 | 1 | middle-aged people |
| 5 | 2020-03-02 | 50s | 834 | 5 | middle-aged people |
| 6 | 2020-03-02 | 60s | 530 | 6 | middle-aged people |
| 7 | 2020-03-02 | 70s | 192 | 6 | 70s |
| 8 | 2020-03-02 | 80s | 81 | 3 | 80s |
| 9 | 2020-03-03 | 0s | 34 | 0 | young people |
df_timeAge["deceased_ratio"] = 100 * df_timeAge["deceased"] / df_timeAge["confirmed"]
df_timeAge.head(10)
| date | age | confirmed | deceased | age_group | deceased_ratio | |
|---|---|---|---|---|---|---|
| 0 | 2020-03-02 | 0s | 32 | 0 | young people | 0.0 |
| 1 | 2020-03-02 | 10s | 169 | 0 | young people | 0.0 |
| 2 | 2020-03-02 | 20s | 1235 | 0 | young people | 0.0 |
| 3 | 2020-03-02 | 30s | 506 | 1 | middle-aged people | 0.197628 |
| 4 | 2020-03-02 | 40s | 633 | 1 | middle-aged people | 0.157978 |
| 5 | 2020-03-02 | 50s | 834 | 5 | middle-aged people | 0.59952 |
| 6 | 2020-03-02 | 60s | 530 | 6 | middle-aged people | 1.132075 |
| 7 | 2020-03-02 | 70s | 192 | 6 | 70s | 3.125 |
| 8 | 2020-03-02 | 80s | 81 | 3 | 80s | 3.703704 |
| 9 | 2020-03-03 | 0s | 34 | 0 | young people | 0.0 |
sns.lineplot(
data=df_timeAge, x="date", y="deceased_ratio", hue="age_group", errorbar=None
)
plt.ylabel("decease ratio (%)")
charts.rotate_x_labels()
Conclusion: There seems to be a clear trend and separation between 70s and 80s. We cannot discard that age is a factor that could be related to severity of symptoms.
As we pointed earlier, this graph only tracks mortality, which means that the number of non-lethal cases (deceased + serious illness + life-changing complications will be worse than what we could diagram)
For this analysis we want to review case data to see if we can trace infections through the population and detect which population segment is the largest spreader
Seeing how 20-29s are the group with the most infections, we suspect that they are the ones who might be helping this virus spread around.
Let's look at direct infection cases to confirm/reject this expectation.
Locating the victims of infections
patient_contact = (
df_patientInfo[df_patientInfo["infection_case"] == "contact with patient"]
.reset_index()
.copy()
)
patient_contact.head()
| patient_id | sex | age | country | province | city | infection_case | infected_by | confirmed_date | released_date | deceased_date | state | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1000000003 | male | 50s | Korea | Seoul | Jongno-gu | contact with patient | 2002000001 | 2020-01-30 | 2020-02-19 | NaT | released |
| 1 | 1000000005 | female | 20s | Korea | Seoul | Seongbuk-gu | contact with patient | 1000000002 | 2020-01-31 | 2020-02-24 | NaT | released |
| 2 | 1000000006 | female | 50s | Korea | Seoul | Jongno-gu | contact with patient | 1000000003 | 2020-01-31 | 2020-02-19 | NaT | released |
| 3 | 1000000007 | male | 20s | Korea | Seoul | Jongno-gu | contact with patient | 1000000003 | 2020-01-31 | 2020-02-10 | NaT | released |
| 4 | 1000000010 | female | 60s | Korea | Seoul | Seongbuk-gu | contact with patient | 1000000003 | 2020-02-05 | 2020-02-29 | NaT | released |
Locating the infectors
infector_lookup = df_patientInfo.reset_index()[
["patient_id", "age", "sex", "province", "city"]
].copy()
infector_lookup["patient_id"] = infector_lookup["patient_id"].astype(str)
infector_lookup = infector_lookup.add_prefix("infector_")
infector_lookup["infector_age_group"] = infector_lookup["infector_age"].map(
age_group_mapper
)
infector_lookup.head()
| infector_patient_id | infector_age | infector_sex | infector_province | infector_city | infector_age_group | |
|---|---|---|---|---|---|---|
| 0 | 1000000001 | 50s | male | Seoul | Gangseo-gu | middle-aged people |
| 1 | 1000000002 | 30s | male | Seoul | Jungnang-gu | middle-aged people |
| 2 | 1000000003 | 50s | male | Seoul | Jongno-gu | middle-aged people |
| 3 | 1000000004 | 20s | male | Seoul | Mapo-gu | young people |
| 4 | 1000000005 | 20s | female | Seoul | Seongbuk-gu | young people |
Merging the data
infection_depth_1 = patient_contact.merge(
infector_lookup, left_on="infected_by", right_on="infector_patient_id"
)
infection_depth_1
| patient_id | sex | age | country | province | city | infection_case | infected_by | confirmed_date | released_date | deceased_date | state | infector_patient_id | infector_age | infector_sex | infector_province | infector_city | infector_age_group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1000000005 | female | 20s | Korea | Seoul | Seongbuk-gu | contact with patient | 1000000002 | 2020-01-31 | 2020-02-24 | NaT | released | 1000000002 | 30s | male | Seoul | Jungnang-gu | middle-aged people |
| 1 | 1000000006 | female | 50s | Korea | Seoul | Jongno-gu | contact with patient | 1000000003 | 2020-01-31 | 2020-02-19 | NaT | released | 1000000003 | 50s | male | Seoul | Jongno-gu | middle-aged people |
| 2 | 1000000007 | male | 20s | Korea | Seoul | Jongno-gu | contact with patient | 1000000003 | 2020-01-31 | 2020-02-10 | NaT | released | 1000000003 | 50s | male | Seoul | Jongno-gu | middle-aged people |
| 3 | 1000000010 | female | 60s | Korea | Seoul | Seongbuk-gu | contact with patient | 1000000003 | 2020-02-05 | 2020-02-29 | NaT | released | 1000000003 | 50s | male | Seoul | Jongno-gu | middle-aged people |
| 4 | 1000000017 | male | 70s | Korea | Seoul | Jongno-gu | contact with patient | 1000000003 | 2020-02-20 | 2020-03-01 | NaT | released | 1000000003 | 50s | male | Seoul | Jongno-gu | middle-aged people |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1225 | 6100000082 | male | 50s | Korea | Gyeongsangnam-do | Geochang-gun | contact with patient | 6100000068 | 2020-03-07 | 2020-03-19 | NaT | released | 6100000068 | 60s | female | Gyeongsangnam-do | Geochang-gun | middle-aged people |
| 1226 | 6100000079 | male | 20s | Korea | Gyeongsangnam-do | Changnyeong-gun | contact with patient | 6100000076 | 2020-03-06 | NaT | NaT | released | 6100000076 | 20s | male | Gyeongsangnam-do | Changnyeong-gun | young people |
| 1227 | 6100000111 | male | 20s | Korea | Gyeongsangnam-do | Sacheon-si | contact with patient | 6100000108 | 2020-04-06 | NaT | NaT | released | 6100000108 | 10s | male | Gyeongsangnam-do | Sacheon-si | young people |
| 1228 | 6100000112 | male | 60s | Korea | Gyeongsangnam-do | Hapcheon-gun | contact with patient | 6100000100 | 2020-04-07 | NaT | NaT | released | 6100000100 | 60s | female | Gyeongsangnam-do | Jinju-si | middle-aged people |
| 1229 | 7000000011 | male | 30s | Korea | Jeju-do | Jeju-do | contact with patient | 7000000009 | 2020-04-03 | 2020-05-19 | NaT | released | 7000000009 | 20s | female | Jeju-do | Jeju-do | young people |
1230 rows × 18 columns
sns.countplot(data=infection_depth_1, x="infector_age")
charts.rotate_x_labels()
It seems that our initial expectations were misguided. While we cannot confirm with certainty, due to the small number of datapoints we have, we can likely reject that it's the young (20s or less) that are doing the spreading. The chart above seems to indicate that we're off by an entire generation.
This could be for various reasons:
Let's visualize this over 2 dimensions to see the infector/infected vectors (in case there are any hidden patterns)
infection_count = pd.DataFrame(
infection_depth_1[["age", "infector_age"]].value_counts()
).reset_index()
infection_count = infection_count.pivot(index="age", columns="infector_age", values=0)
infection_count = infection_count.drop(columns=["unknown"], index=["unknown"])
infection_count = infection_count.sort_index().sort_index(axis=1)
infection_count
| infector_age | 0s | 10s | 20s | 30s | 40s | 50s | 60s | 70s | 80s | 90s |
|---|---|---|---|---|---|---|---|---|---|---|
| age | ||||||||||
| 0s | 1.0 | NaN | NaN | 15.0 | 8.0 | NaN | NaN | NaN | NaN | NaN |
| 10s | NaN | 16.0 | 9.0 | 4.0 | 19.0 | 3.0 | 2.0 | 3.0 | NaN | NaN |
| 20s | NaN | 6.0 | 38.0 | 13.0 | 17.0 | 24.0 | 4.0 | 14.0 | 3.0 | NaN |
| 30s | 1.0 | 2.0 | 9.0 | 38.0 | 21.0 | 10.0 | 10.0 | 9.0 | 4.0 | NaN |
| 40s | 1.0 | 5.0 | 7.0 | 12.0 | 63.0 | 27.0 | 6.0 | 9.0 | 2.0 | NaN |
| 50s | 1.0 | 7.0 | 29.0 | 11.0 | 29.0 | 49.0 | 23.0 | 25.0 | 11.0 | 1.0 |
| 60s | NaN | NaN | 5.0 | 17.0 | 12.0 | 15.0 | 47.0 | 21.0 | 5.0 | 1.0 |
| 70s | NaN | 1.0 | 3.0 | 2.0 | 4.0 | 8.0 | 11.0 | 10.0 | 15.0 | NaN |
| 80s | NaN | NaN | 3.0 | NaN | 2.0 | 2.0 | 6.0 | 7.0 | 11.0 | 1.0 |
| 90s | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | 1.0 | 7.0 | NaN |
ax = sns.heatmap(data=infection_count, cmap="magma_r", annot=True)
ax.set(xlabel="infector", ylabel="infected")
ax.xaxis.tick_top()
ax.xaxis.set_label_position("top")
No obvious patterns emerge other than the primary diagonal, which seems to indicate that most of the infections occur within the same age group.
Some age groups also have a secondary cross-group infection with groups 30 years apart:
We suspect this is due to Korea-specific metrics and represents age gaps between parents/children (infections within the same family).
Mother's age at first birth is 33 years old in South Korea.
infection_by_city = infection_depth_1[
["province", "city", "infector_province", "infector_city"]
].copy()
infection_by_city["infected"] = infection_by_city["province"].str.cat(
infection_by_city["city"], sep="/"
)
infection_by_city["infector"] = infection_by_city["infector_province"].str.cat(
infection_by_city["infector_city"], sep="/"
)
infection_by_city.head()
| province | city | infector_province | infector_city | infected | infector | |
|---|---|---|---|---|---|---|
| 0 | Seoul | Seongbuk-gu | Seoul | Jungnang-gu | Seoul/Seongbuk-gu | Seoul/Jungnang-gu |
| 1 | Seoul | Jongno-gu | Seoul | Jongno-gu | Seoul/Jongno-gu | Seoul/Jongno-gu |
| 2 | Seoul | Jongno-gu | Seoul | Jongno-gu | Seoul/Jongno-gu | Seoul/Jongno-gu |
| 3 | Seoul | Seongbuk-gu | Seoul | Jongno-gu | Seoul/Seongbuk-gu | Seoul/Jongno-gu |
| 4 | Seoul | Jongno-gu | Seoul | Jongno-gu | Seoul/Jongno-gu | Seoul/Jongno-gu |
Let's also trim down the dataset. Let's exclude intra-province infections and see if there are any hot trans-province trends.
infection_by_city_count = pd.DataFrame(infection_by_city.value_counts()).reset_index()
infection_by_city_count = infection_by_city_count[
infection_by_city_count["infected"] != infection_by_city_count["infector"]
]
infection_by_city_count.head()
| province | city | infector_province | infector_city | infected | infector | 0 | |
|---|---|---|---|---|---|---|---|
| 11 | Chungcheongnam-do | Cheonan-si | Chungcheongnam-do | Asan-si | Chungcheongnam-do/Cheonan-si | Chungcheongnam-do/Asan-si | 18 |
| 16 | Gyeonggi-do | Bucheon-si | Incheon | Bupyeong-gu | Gyeonggi-do/Bucheon-si | Incheon/Bupyeong-gu | 11 |
| 21 | Incheon | Michuhol-gu | Incheon | Bupyeong-gu | Incheon/Michuhol-gu | Incheon/Bupyeong-gu | 9 |
| 26 | Gyeonggi-do | Gwangju-si | Gyeonggi-do | Seongnam-si | Gyeonggi-do/Gwangju-si | Gyeonggi-do/Seongnam-si | 8 |
| 29 | Gyeonggi-do | Bucheon-si | Seoul | Nowon-gu | Gyeonggi-do/Bucheon-si | Seoul/Nowon-gu | 7 |
infection_by_city_count = infection_by_city_count.pivot(
index="infected", columns="infector", values=0
)
plt.figure(figsize=(10, 10))
ax = sns.heatmap(data=infection_by_city_count, cmap="magma_r")
ax.set(xlabel="infector", ylabel="infected")
ax.xaxis.tick_top()
ax.xaxis.set_label_position("top")
plt.title("trans-city infections")
charts.rotate_x_labels()
We can see some interesting insights:
infection_by_province_count = pd.DataFrame(
infection_by_city.drop(
columns=["infected", "infector", "city", "infector_city"]
).value_counts()
).reset_index()
infection_by_province_count.head()
| province | infector_province | 0 | |
|---|---|---|---|
| 0 | Gyeonggi-do | Gyeonggi-do | 486 |
| 1 | Incheon | Incheon | 145 |
| 2 | Gyeonggi-do | Seoul | 109 |
| 3 | Seoul | Seoul | 106 |
| 4 | Gyeongsangbuk-do | Gyeongsangbuk-do | 98 |
infection_by_province_count = infection_by_province_count.pivot(
index="province", columns="infector_province", values=0
)
infection_by_province_count = infection_by_province_count.sort_index().sort_index(
axis=1
)
infection_by_province_count
| infector_province | Busan | Chungcheongbuk-do | Chungcheongnam-do | Daegu | Daejeon | Gwangju | Gyeonggi-do | Gyeongsangbuk-do | Gyeongsangnam-do | Incheon | Jeju-do | Jeollabuk-do | Sejong | Seoul | Ulsan |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| province | |||||||||||||||
| Busan | 25.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| Chungcheongbuk-do | NaN | 8.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Chungcheongnam-do | NaN | NaN | 89.0 | NaN | 4.0 | NaN | 1.0 | NaN | 1.0 | NaN | NaN | NaN | 1.0 | 1.0 | NaN |
| Daegu | NaN | NaN | NaN | 3.0 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Daejeon | NaN | NaN | 2.0 | 1.0 | 38.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Gwangju | NaN | NaN | NaN | NaN | NaN | 16.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Gyeonggi-do | 1.0 | NaN | NaN | 4.0 | 4.0 | NaN | 486.0 | NaN | NaN | 30.0 | NaN | NaN | NaN | 109.0 | NaN |
| Gyeongsangbuk-do | NaN | NaN | NaN | 3.0 | NaN | NaN | NaN | 98.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Gyeongsangnam-do | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | 22.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| Incheon | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 145.0 | NaN | NaN | NaN | NaN | NaN |
| Jeju-do | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN |
| Jeollabuk-do | NaN | NaN | NaN | NaN | 1.0 | 1.0 | NaN | NaN | NaN | NaN | NaN | 2.0 | NaN | NaN | NaN |
| Jeollanam-do | NaN | NaN | NaN | NaN | NaN | 2.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 |
| Sejong | NaN | NaN | NaN | NaN | 2.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 6.0 | NaN | NaN |
| Seoul | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 106.0 | 1.0 |
| Ulsan | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | 11.0 |
ax = sns.heatmap(data=infection_by_province_count, cmap="magma_r")
ax.set(xlabel="infector", ylabel="infected")
ax.xaxis.tick_top()
ax.xaxis.set_label_position("top")
plt.title("infection vectors by province")
charts.rotate_x_labels()
This chart shows the same conclusions as before, but at a lower resolution, providing more clarity and reducing noise.
We've included same-province infections so we can confirm the expectations we had when we did the trans-city analysis.
Let's try to visualize these hotspots over a real map, so we can get a better understanding.
Expectation:
Most of the infections (hotspots) begin around Seoul and adjacent cities.
We will need to visualize geographical data to make some sense of this
infection_depth_1.head()
| patient_id | sex | age | country | province | city | infection_case | infected_by | confirmed_date | released_date | deceased_date | state | infector_patient_id | infector_age | infector_sex | infector_province | infector_city | infector_age_group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1000000005 | female | 20s | Korea | Seoul | Seongbuk-gu | contact with patient | 1000000002 | 2020-01-31 | 2020-02-24 | NaT | released | 1000000002 | 30s | male | Seoul | Jungnang-gu | middle-aged people |
| 1 | 1000000006 | female | 50s | Korea | Seoul | Jongno-gu | contact with patient | 1000000003 | 2020-01-31 | 2020-02-19 | NaT | released | 1000000003 | 50s | male | Seoul | Jongno-gu | middle-aged people |
| 2 | 1000000007 | male | 20s | Korea | Seoul | Jongno-gu | contact with patient | 1000000003 | 2020-01-31 | 2020-02-10 | NaT | released | 1000000003 | 50s | male | Seoul | Jongno-gu | middle-aged people |
| 3 | 1000000010 | female | 60s | Korea | Seoul | Seongbuk-gu | contact with patient | 1000000003 | 2020-02-05 | 2020-02-29 | NaT | released | 1000000003 | 50s | male | Seoul | Jongno-gu | middle-aged people |
| 4 | 1000000017 | male | 70s | Korea | Seoul | Jongno-gu | contact with patient | 1000000003 | 2020-02-20 | 2020-03-01 | NaT | released | 1000000003 | 50s | male | Seoul | Jongno-gu | middle-aged people |
lat_long_lookup = (
df_region[["province", "city", "latitude", "longitude"]]
.copy()
.set_index(["province", "city"])
)
lat_long_lookup
| latitude | longitude | ||
|---|---|---|---|
| province | city | ||
| Seoul | Seoul | 37.566953 | 126.977977 |
| Gangnam-gu | 37.518421 | 127.047222 | |
| Gangdong-gu | 37.530492 | 127.123837 | |
| Gangbuk-gu | 37.639938 | 127.025508 | |
| Gangseo-gu | 37.551166 | 126.849506 | |
| ... | ... | ... | ... |
| Gyeongsangnam-do | Haman-gun | 35.272481 | 128.40654 |
| Hamyang-gun | 35.520541 | 127.725177 | |
| Hapcheon-gun | 35.566702 | 128.16587 | |
| Jeju-do | Jeju-do | 33.488936 | 126.500423 |
| Korea | Korea | 37.566953 | 126.977977 |
244 rows × 2 columns
@lru_cache(maxsize=512) # use memoization to speed up lookups
def latlong_lookup(province: str, city: str, attribute: str) -> float:
try:
result = lat_long_lookup.loc[(province, city), attribute]
return result
except:
return np.nan
infection_vectors = infection_depth_1[
["province", "city", "infector_province", "infector_city"]
].copy()
def enrich_with_coordinates(row):
row["infected_latitude"] = latlong_lookup(row["province"], row["city"], "latitude")
row["infected_longitude"] = latlong_lookup(
row["province"], row["city"], "longitude"
)
row["infector_latitude"] = latlong_lookup(
row["infector_province"], row["infector_city"], "latitude"
)
row["infector_longitude"] = latlong_lookup(
row["infector_province"], row["infector_city"], "longitude"
)
return row
# TODO remove slow operation
# OPTION 1 - this takes 2.5 sec
# infection_vectors = infection_vectors.apply(enrich_with_coordinates, axis=1)
# OPTION 2 - this takes 0.057 sec
infection_vectors["infected_latitude"] = infection_vectors.apply(
lambda row: latlong_lookup(row["province"], row["city"], "latitude"), axis=1
)
infection_vectors["infected_longitude"] = infection_vectors.apply(
lambda row: latlong_lookup(row["province"], row["city"], "longitude"), axis=1
)
infection_vectors["infector_latitude"] = infection_vectors.apply(
lambda row: latlong_lookup(
row["infector_province"], row["infector_city"], "latitude"
),
axis=1,
)
infection_vectors["infector_longitude"] = infection_vectors.apply(
lambda row: latlong_lookup(
row["infector_province"], row["infector_city"], "longitude"
),
axis=1,
)
infection_vectors = infection_vectors.dropna(axis=0)
infection_vectors
| province | city | infector_province | infector_city | infected_latitude | infected_longitude | infector_latitude | infector_longitude | |
|---|---|---|---|---|---|---|---|---|
| 0 | Seoul | Seongbuk-gu | Seoul | Jungnang-gu | 37.589562 | 127.016700 | 37.606832 | 127.092656 |
| 1 | Seoul | Jongno-gu | Seoul | Jongno-gu | 37.572999 | 126.979189 | 37.572999 | 126.979189 |
| 2 | Seoul | Jongno-gu | Seoul | Jongno-gu | 37.572999 | 126.979189 | 37.572999 | 126.979189 |
| 3 | Seoul | Seongbuk-gu | Seoul | Jongno-gu | 37.589562 | 127.016700 | 37.572999 | 126.979189 |
| 4 | Seoul | Jongno-gu | Seoul | Jongno-gu | 37.572999 | 126.979189 | 37.572999 | 126.979189 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1225 | Gyeongsangnam-do | Geochang-gun | Gyeongsangnam-do | Geochang-gun | 35.686526 | 127.910021 | 35.686526 | 127.910021 |
| 1226 | Gyeongsangnam-do | Changnyeong-gun | Gyeongsangnam-do | Changnyeong-gun | 35.544603 | 128.492330 | 35.544603 | 128.492330 |
| 1227 | Gyeongsangnam-do | Sacheon-si | Gyeongsangnam-do | Sacheon-si | 35.003668 | 128.064272 | 35.003668 | 128.064272 |
| 1228 | Gyeongsangnam-do | Hapcheon-gun | Gyeongsangnam-do | Jinju-si | 35.566702 | 128.165870 | 35.180313 | 128.108750 |
| 1229 | Jeju-do | Jeju-do | Jeju-do | Jeju-do | 33.488936 | 126.500423 | 33.488936 | 126.500423 |
1167 rows × 8 columns
animated_paths = True
map_lat_long = folium.Map(
location=[36.0859, 127.9468], zoom_start=6.5, control_scale=True, width=800
)
for index, row in infection_vectors.iterrows():
if animated_paths:
plugins.AntPath(
[
[row["infector_latitude"], row["infector_longitude"]],
[row["infected_latitude"], row["infected_longitude"]],
],
weight=3,
).add_to(map_lat_long)
else:
folium.PolyLine(
[
[row["infector_latitude"], row["infector_longitude"]],
[row["infected_latitude"], row["infected_longitude"]],
],
weight=2,
).add_to(map_lat_long)
folium_utils.figure(map_lat_long)
This graph does not really match what we expected! We remember seeing lots of cases around Gyeonggi-do, but the bulk of lines appear to be around Seoul. Let's find out why:
infection_vectors.value_counts().head(10)
province city infector_province infector_city infected_latitude infected_longitude infector_latitude infector_longitude
Gyeonggi-do Seongnam-si Gyeonggi-do Seongnam-si 37.420000 127.126703 37.420000 127.126703 98
Bucheon-si Gyeonggi-do Bucheon-si 37.503393 126.766049 37.503393 126.766049 62
Gunpo-si Gyeonggi-do Gunpo-si 37.361653 126.935206 37.361653 126.935206 48
Chungcheongnam-do Cheonan-si Chungcheongnam-do Cheonan-si 36.814980 127.113868 36.814980 127.113868 45
Gyeongsangbuk-do Yecheon-gun Gyeongsangbuk-do Yecheon-gun 36.646707 128.437435 36.646707 128.437435 35
Incheon Bupyeong-gu Incheon Bupyeong-gu 37.507031 126.721804 37.507031 126.721804 28
Gyeonggi-do Suwon-si Gyeonggi-do Suwon-si 37.263376 127.028613 37.263376 127.028613 28
Gyeongsangbuk-do Gyeongju-si Gyeongsangbuk-do Gyeongju-si 35.856185 129.224796 35.856185 129.224796 26
Gyeonggi-do Pyeongtaek-si Gyeonggi-do Pyeongtaek-si 36.992293 127.112709 36.992293 127.112709 22
Incheon Michuhol-gu Incheon Michuhol-gu 37.463572 126.650270 37.463572 126.650270 22
dtype: int64
There are two issues:
Let's add a couple of improvements to help visualization:
animated_paths = True
map_lat_long = folium.Map(
location=[36.0859, 127.9468], zoom_start=6.5, control_scale=True, width=900
)
def jitter(val: float) -> float:
min = -0.03
max = 0.03
return val + ((random() * (max - min)) + min)
for index, row in infection_vectors.iterrows():
config = {
"locations": [
[jitter(row["infector_latitude"]), jitter(row["infector_longitude"])],
[jitter(row["infected_latitude"]), jitter(row["infected_longitude"])],
],
"weight": 3,
"opacity": 0.3,
}
if animated_paths:
plugins.AntPath(**config).add_to(map_lat_long)
else:
folium.PolyLine(**config).add_to(map_lat_long)
top_hotspots = infection_vectors.value_counts().head(10)
for index, row in top_hotspots.items():
city = index[1]
lat = index[4]
long = index[5]
folium.Circle(
radius=5000, location=[lat, long], color="red", fill_color="red", tooltip=city
).add_to(map_lat_long)
folium_utils.figure(map_lat_long)
Let's review the list of expectations/theories we started with and assess whether they can be rejected or they warrant further investigation.
This is the original list we had:
Strong social distancing policies helped reduce the number of infections ✅ The data shows that there is a sharp drop of confirmed infections once the social distancing policies are implemented.
The relatively slow decline (over a couple of weeks) could be attributed to the virus' inherent incubation period (between 2 and 14 days).
Given the data, we cannot reject this claim and we could do further research on it.
Elder population are most at-risk of serious illness or death if they contract COVID-19 ✅
Even though the young generations were the ones with most confirmed cases of infection, the elder population took the biggest hit in terms of deceased cases.
The situation reaches catastrofic numbers when we look at the ratios instead of the absolute numbers, with 80+ year old reaching almost a 25% mortality rate.
These numbers might have a bias because we don't if the different groups we are comparing had similar rates of testing: while we do have a country-wide tested/confirmed/deceased cases, we do not have a breakdown of tested-cases-per-age-group.
Getting this data might help future analysis understand if there is a bias in this dataset or not.
Young adults are the largest spreader of confirmed cases ❌
Contrary to what we expected, young people are not the primary vector for infection.
Ignoring the notably sized group of people that is missing "infected_by" data, the group with the most infections is the 40-59 years old.
A cross-age analysis showed that most of the infections ocurr within the same age group.
Some age groups also have a secondary cross-group infection with groups 30 years apart.
We suspect this is due to Korea-specific metrics and represents age gaps between parents/children (infections within the same family).
Mother's age at first birth is 33 years old in South Korea.
Most of the infections (hotspots) begin around Seoul and adjacent cities. ✅
As expected most of the hotspots included Seoul and adjacent cities.
An in depth analysis allowed us to identify cross-province and cross-city infections, intra-city clusters, as well as see point-to-point infections.
This dataset includes South Korean data (population, infection, country statistics, etc...) during the first wave of the COVID-19 pandemic.
Despite the large population density, South Korea managed to have an outstanding performance and very low mortality rates.
This was aided by several factors:
PatientRoute CSV file that was removed from kaggle due to privacy concerns.